ACCEPTED PAPERS

  • Article Category: PAPERS
  • 11/30/2022

 

 

 

Paper ID Paper Title Authors
9 Transduce and Speak: Neural Transducer for Text-To-Speech with Semantic Token Prediction Minchan Kim (Seoul National University)*; Myeonghun Jeong (Seoul National University); Byoung Jin Choi (Seoul National University); Dongjune Lee (Seoul National University); Nam Soo Kim (Seoul National University)
11 Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-To-Sequence End-To-End Spoken Language Understanding Pavel Denisov (University of Stuttgart)*; Ngoc Thang Vu (University of Stuttgart)
12 LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models Chi-Chang Lee (Academia Sinica)*; Hong Wei Chen (National Taiwan University); Chu-Song Chen (National Taiwan University); Hsin-Min Wang (Academia Sinica); Tsung-Te Liu (National Taiwan University); Yu Tsao (Academia Sinica)
22 Variational Gaussian Process Data Uncertainty Jeremy H. M. Wong (Institute for Infocomm Research)*; Huayun  Zhang  (ASTAR ); Nancy Chen (Institute for Infocomm Research)
26 Low-rank Adaptation of Neural Language Model Rescoring for Speech Recognition Yu Yu (Stevens Institute of Technology); Chao-Han Huck Yang (Amazon)*; Jari T Kolehmainen (Amazon); Prashanth Gurunath Shivakumar (Amazon); Yile Gu (Amazon); Sungho Ryu (Amazon); Roger Ren (Amazon); Qi Luo (Amazon.com Inc.); Aditya Gourav (Amazon); I-Fan Chen (Amazon Inc.); Yi Chieh Liu (Amazon); Tuan Dinh (Amazon); Denis Filimonov (Amazon); Ankur Gandhe (Amazon Alexa); Andreas Stolcke (Amazon); Ariya Rastrow (Amazon Alexa); Ivan Bulyko (Amazon)
27 CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers Xintong Wang (XiaoIce ); Chang Zeng (National Institute of Informatics)*; Jun Chen (Tsinghua University); wang chun hui (XiaoIce)
30 Active Learning Based Fine-Tuning Framework for Speech Emotion Recognition Dongyuan Li (Tokyo Institute of Technology)*; Yusong WANG (Tokyo Institute of Technology); Kotrao Funakoshi (Tokyo Institute of Technology); Manabu Okumura (Tokyo Institute of Technology)
32 Identifying People with Mild Cognitive Impairment At Risk of Developing Dementia Using Speech Analysis Bahman Mirheidari (University of Sheffield)*; Ronan O’Malley (University of Sheffield); Daniel Blackburn (University of Sheffield); Heidi Christensen (University of Sheffield)
36 Bisinger: Bilingual Singing Voice Synthesis Huali Zhou (Wuhan University); Yueqian Lin (Duke Kunshan University); Yao Shi (Duke Kunshan University); Peng Sun (Duke Kunshan University); Ming Li (Duke Kunshan University)*
44 Robust Recognition of Speaker Emotion with Difference Feature Extraction Using a Few Enrollment Utterances Daichi Hayakawa (Toshiba Corporation Corporate R&D Center)*; Takehiko Kagoshima (Toshiba Corporation Corporate R&D Center); Kenji Iwata (Toshiba Corporation Corporate R&D Center); Rama S Doddipatla (Toshiba Europe LTD); Norbert Braunschweiler (Toshiba Europe Limited)
47 Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking Jihyun Lee (Pohang University of Science and Technology)*; Yejin Jeon (POSTECH); Wonjun Lee (POSTECH); Yunsu Kim (POSTECH); Gary Geunbae  Lee  (Postech)
48 Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition Yuang Li (University of Cambridge)*; Yu Wu (Microsoft Research Asia); Jinyu Li (Microsoft); Shujie Liu (Microsoft Research Asia)
50 Mbtfnet: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement Weiming Xu (Northwest Polytechnic University)*; Xuanzhou Chen (Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Zhili Tan (Tencent); Shubo Lv (Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University); Runduo Han (Northwestern Polytechnical University); Wenjiang  Zhou  ( Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Weifeng Zhao ( Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Lei Xie (NWPU)
51 Can We Use Speaker Embeddings on Spontaneous Speech Obtained From Medical Conversations to Predict Intelligibility? Sebastião Quintas (IRIT, Université de Toulouse, CNRS, Toulouse, France)*; Mathieu Balaguer (IRIT); Julie Mauclair (IRIT); Virginie Woisard (Hospitals of Toulouse); Julien Pinquier (IRIT)
53 End-to-End Training of a Neural HMM with Label and Transition Probabilities Daniel Mann (RWTH Aachen University)*; Tina Raissi (RWTH Aachen University); Wilfried Michel (AppTek); Ralf Schlüter (RWTH Aachen University); Hermann Ney ( RWTH Aachen University)
54 Wiki-En-Asr-Adapt: Large-Scale Synthetic Dataset for English Asr Customization Alexandra A Antonova (Moscow Institute of Physics and Technology)*
58 The Role of Feature Correlation on Quantized Neural Networks David Qiu (Google)*; Shaojin Ding (Google); Yanzhang He (Google)
59 LV-CTC: Non-autoregressive ASR With CTC and Latent Variable Models Yuya Fujita (Yahoo Japan Corporation)*; Shinji Watanabe (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Takashi Maekaku (Yahoo Japan Corporation)
64 The Singing Voice Conversion Challenge 2023 Wen-Chin Huang (Nagoya University)*; Lester Phillip G Violeta (Nagoya University); Songxiang Liu (Tencent); Jiatong Shi (Carnegie Mellon University); Tomoki Toda (Nagoya University)
65 Improving Multilingual and Code-switching ASR using Large Language Model Generated Text Ke Hu (Google)*; Tara Sainath (Google); Bo Li (Google); Yu Zhang (Google); Yong Cheng (Google); Tao Wang (Google Inc.); Yujing Zhang (Google); Frederick Liu (Google Inc.)
66 Pareto Efficiency of Learning-Forgetting Trade-Off in Neural Language Model Adaptation Jerome R Bellegarda (Apple)*
72 Improved Multi-modal Emotion Recognition using Squeeze-and-Excitation Block in Cross-Modal Attention Junchen Liu (The University of Auckland)*; Jesin James (The University of Auckland); Karan  Nathwani (Indian Institute of Technology, Jammu)
73 Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer Jin Qiu (ByteDance); Lu Huang (ByteDance)*; Boyu Li (ByteDance); Jun Zhang (Bytedance); Lu Lu (Bytedance); Zejun Ma (Bytedance)
76 Locality Enhanced Dynamic Biasing and Sampling Strategies for Contextual ASR Md Asif Jalal (Samsung Research UK)*; Pablo Peso Parada (Samsung Research UK); George Pavlidis (Information Technologies Institute, Centre for Research and Technology); Vasileios Moschopoulos (Information Technologies Institute, Centre for Research and Technology - Hellas, Thessaloniki, Greece); KARTHIKEYAN SARAVANAN (Samsung Research, UK); Chrysovalantis G Kontoulis (Pragma-IoT); Jisi Zhang (Samsung Research UK); Anastasios  Drosou (Information Technologies Institute, Centre for Research and Technology - Hellas, Thessaloniki, Greece); Jung In Lee (Samsung Electronics); Gil Ho Lee (Samsung Electronics); Seokyeong Jung (Samsung Electronics)
77 Robust End-To-End Diarization with Domain Adaptive Training and Multi-Task Learning Ivan Fung (Fano Labs)*; Lahiru T Samarakoon (Fano Labs, Hong Kong); Samuel J Broughton (Fano Labs)
78 Whisper-SLU: Extending a Pretrained Speech-to-Text Transformer for Low Resource Spoken Language Understanding Quentin Meeus (KU Leuven)*; Sien Moens (KU Leuven); Hugo Van hamme (KU Leuven)
80 Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that leverages a Universal Speech Model Hagen Soltau (Google)*; Izhak Shafran (Google AI); Alex Ottenwess (Google); Joseph R. JR Duffy (Mayo Clinic); Rene L Utianski (Mayo Clinic); Leland R. Barnard (Mayo); John L.  Stricker (Mayo Clinic); Daniela Wiepert (Mayo Clinic);  David T. Jones (Mayo Clinic); Hugo Botha (Mayo Clinic)
82 Contextual Spelling Correction With Large Language Models Gan Song (Google)*; Zelin Wu (Google LLC); Golan Pundak (Google); Angad Chandorkar (Google); Xavier Velez (Google); Diamantino Caseiro (Google); Ben Haynor (Google); Weiran Wang (Google); Nikhil Siddhartha (Google); Kandarp Joshi (Google); Pat Rondon (Google); Khe C Sim (Google Inc.)
83 Not All Errors Are Created Equal: Evaluating The Impact of Model And Speaker Factors on ASR Outcomes in Clinical Populations Daniela Wiepert (Mayo Clinic)*; Rene L Utianski (Mayo Clinic); Joseph Duffy (Mayo Clinic); John Stricker (Mayo Clinic); Leland Barnard (Mayo Clinic); Keith Josephs (Mayo Clinic Rochester); Jennifer Whitwell (Mayo Clinic Rochester); David Jones (Mayo Clinic); Hugo Botha (Mayo Clinic)
85 The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning Lillian Zhou (Google)*; Yuxin Ding (Google); Mingqing Chen (Google Inc.); Harry Zhang (Google); Rohit Prabhavalkar (Google); Dhruv Guliani (Google); Giovanni Motta (Google, Inc.); Rajiv Mathews (Google)
90 FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition Dongning Yang (Shanghai Jiao Tong University)*; wei wang (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University)
96 Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition Dongji Gao (Johns Hopkins University)*; Hainan Xu (NVIDIA); Desh Raj (Johns Hopkins University); Paola Garcia (Johns Hopkins University); Daniel Povey (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University)
97 Acoustics-Text Dual-Modal Joint Representation Learning for Cover Song Identification Yanmei Gu (AntGroup)*; Li Jing (AntGroup); Zhou Jiayi (AntGroup); Wang Zhiming (AntGroup); Zhu Huijia (AntGroup)
99 Towards Matching Phones and Speech Representations Gene-Ping Yang (The University of Edinburgh)*; Hao Tang (The University of Edinburgh)
101 RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain Sangeet  Sagar (Saarland University )*; Mirco Ravanelli (Université de Montréal); Bernd Kiefer (DFKI); Ivana Kruijff (DFKI); Josef van Genabith (Saarland University)
103 Mask-Conformer: Augmenting Conformer with Mask-Predict Decoder Yosuke Higuchi (Waseda University)*; Andrew Rosenberg (Google LLC); Yuan Wang (Google); Murali Karthick Baskar (Google Inc); Bhuvana Ramabhadran (Google)
105 Ed-Cec: Improving Rare Word Recognition Using ASR Post-Processing Based on Error Detection and Context-Aware Error Correction Jiajun He (Nagoya University)*; Zekun Yang (Nagoya University); Tomoki Toda (Nagoya University)
108 Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning guanrou yang (Shanghai Jiao Tong University)*; Xie Chen (Shanghai Jiaotong University); Ziyang Ma (Shanghai Jiao Tong University); Zhisheng Zheng (Shanghai Jiao Tong University ); Yakun Song (Shanghai Jiao Tong University); Zhikang Niu (Xidian University)
109 Can Unpaired Textual Data Replace Synthetic Speech In ARU Model Adaptation? Pasquale D'Alterio (Amazon)*; Christian Hensel (Amazon); Bashar Awwad Shiekh Hasan (Amazon)
110 Preserving Phonemic Distinctions For Ordinal Regression: A Novel Loss Function For Automatic Pronunciation Assessment Bi-Cheng Yan (National Taiwan Normal University )*; Hsin-Wei Wang (NTNU); Yi-Cheng Wang (National Taiwan Normal University); Jiun-Ting Li (National Taiwan Normal University); Chi-Han Lin (E.SUN Financial Holding Co., Ltd.); Berlin Chen (National Taiwan Normal University)
111 Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition Yujin Wang (Tsinghua University); Changli Tang (Tsinghua University)*; Ziyang Ma (Shanghai Jiao Tong University); Zhisheng Zheng (Shanghai Jiao Tong University ); Xie Chen (Shanghai Jiaotong University); Wei-Qiang Zhang (Tsinghua University)
112 Efficient Cascaded Streaming ASR System via Frame Rate Reduction Xingyu Cai (Google)*; David Qiu (Google); Shaojin Ding (Google); Dongseong Hwang (Google); Weiran Wang (Google); Antoine Bruguier (Google); Rohit Prabhavalkar (Google); Tara Sainath (Google); Yanzhang He (Google)
119 VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention Yuewei Zhang (Shanghai Jiao Tong University)*; Huanbin Zou (Tencent); jie zhu (Shanghai Jiao Tong University)
123 Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization Wei-Ping Huang (National Taiwan University)*; Sung-Feng Huang (National Taiwan University); Hung-yi Lee (National Taiwan University)
124 Meta-Learning Framework for End-To-End Imposter Identification in Unseen Speaker Recognition Ashutosh Chaubey (LG Ad Solutions); Sparsh Sinha (LG Ad Solutions)*; Susmita Ghose (LG Ad Solutions)
127 Using Joint Training Speaker Encoder with Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice Conversion Houjian Guo (Osaka Univeristy, Riken Guardian Robot Group); Chaoran Liu (Riken)*; Carlos T Ishi (RIKEN); Hiroshi Ishiguro (Osaka University)
128 Quickvc: a Lightweight VITS-Based Any-To-Many Voice Conversion Model Using iSTFT for Faster Conversion Houjian Guo (Osaka Univeristy, Riken Guardian Robot Group); Chaoran Liu (Riken)*; Carlos T Ishi (RIKEN); Hiroshi Ishiguro (Osaka University)
130 Multi Transcription-Style Speech Transcription Using Attention-Based Encoder-Decoder Model Yan Huang (Microsoft Research)*; Piyush Behre (Microsoft); Guoli Ye (Microsoft); Shawn Chang (); Yifan Gong (Microsoft)
134 NeuralKalman: A Learnable Kalman Filter for Acoustic Echo Cancellation Yixuan Zhang (The Ohio State University)*; Meng Yu (Tencent); Hao Zhang (Tencent AI Lab); Dong Yu (Tencent AI Lab); DeLiang Wang (Ohio State University)
135 Thai-Dialect: Low Resource Thai Dialectal Speech to Text Corpora Artit Suwanbandit (Chulalongkorn University)*; Jaturong Chitiyaphol (KhonKaen University); Sutthinan Chuenchom (Chiang Mai Rajabhat University); Kanyarat Kwiecien (Khon Kaen University); Husen Sawal (Prince of Songkla University); Ruslan Uthai (Prince of Songkla University); Orathai Sangpetch (CMKL University ); Ekapol Chuangsuwanich (Chulalongkorn University)
140 Deep Learning for Joint Acoustic Echo and Acoustic Howling Suppression in Hybrid Meetings Hao Zhang (Tencent AI Lab)*; Meng Yu (Tencent); Dong Yu (Tencent AI Lab)
142 On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments William Ravenscroft (The University of Sheffield)*; Stefan Goetze (University of Sheffield); Thomas Hain (University of Sheffield)
143 NeuralEcho: Hybrid of Full-Band and Sub-Band Recurrent Neural Network for Acoustic Echo Cancellation and Speech Enhancement Meng Yu (Tencent)*; Yong Xu (Tencent); Chunlei  zhang (Tencent AI Lab); Shixiong Zhang (Tencent AI Lab); Dong Yu (Tencent AI Lab)
144 Combining relative and absolute learning formulations to predict emotional attributes from speech Abinay Reddy Naini (The University of Texas at Dallas); Shruthi Subramanium (The University of Texas at Dallas); Seong-Gyun Leem (University of Texas at Dallas); Carlos Busso (University of Texas at Dallas)*
145 ESPNet-SUMM: Introducing a novel large dataset, toolkit, and a cross-corpora evaluation of speech summarization systems Roshan S Sharma (Carnegie Mellon University)*; William Chen (Carnegie Mellon University); Takatomo Kano (NTT Corporation); Ruchira S Sharma (University of Massachusetts, Amherst); Atsunori Ogawa (NTT Corporation); Siddhant Arora (Carnegie Mellon University); Marc Delcroix (NTT); Rita Singh (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Bhiksha Raj (Carnegie Mellon University)
146 Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data Yifan Peng (Carnegie Mellon University)*; Jinchuan Tian (Carnegie Mellon University); Brian Yan (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Xinjian Li (Carnegie Mellon University); Jiatong Shi (Carnegie Mellon University); Siddhant Arora (Carnegie Mellon University); William Chen (Carnegie Mellon University); Roshan S Sharma (Carnegie Mellon University); Wangyou Zhang (Shanghai Jiao Tong University); Yui Sudo (Honda Research Institute Japan); Muhammad Mr. Shakeel (Honda Research Institute Japan); Jee-weon Jung (Carnegie Mellon University); Soumi Maiti (CMU); Shinji Watanabe (Carnegie Mellon University)
155 Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference Masao Someki (IBM)*; Nicholas Eng (The University of Auckland); Yosuke Higuchi (Waseda University); Shinji Watanabe (Carnegie Mellon University)
156 Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning William Chen (Carnegie Mellon University)*; Jiatong Shi (Carnegie Mellon University); Brian Yan (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Wangyou Zhang (Shanghai Jiao Tong University); Yifan Peng (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Soumi Maiti (CMU); Shinji Watanabe (Carnegie Mellon University)
163 Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond Jiatong Shi (Carnegie Mellon University)*; William Chen (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Hsiu-Hsuan Wang (National Taiwan University ); Wei Ping Huang (National Taiwan University); En Pei Hu (National Taiwan University); ho lam Chung (National Taiwan University); Xuankai Chang (Carnegie Mellon University); Yuxun Tang (Renmin University of China); Shang-Wen Li (Meta AI); Abdelrahman Mohamed (Rembrand Inc); Hung-yi Lee (National Taiwan University); Shinji Watanabe (Carnegie Mellon University)
165 Diffusion-Based Mel-Spectrogram Enhancement For Personalized Speech Synthesis With Found Data Yusheng Tian (The Chinese University of Hong Kong)*; Wei Liu (The Chinese University of Hong Kong); Tan Lee (The Chinese University of Hong Kong)
166 Sqat-Ld: Speech Quality Assessment Transformer Utilizing Listener Dependent Modeling For Zero-Shot Out-Of-Domain Mos Prediction Kailai Shen (Ningbo University); Diqun Yan (Ningbo University)*; Li Dong (Ningbo University); Ren Ying (Ningbo University); Xiaoxun Wu (Ningbo University); Jing Hu (Ningbo University)
170 Scenario-Aware Audio-Visual Tf-Gridnet For Target Speech Extraction Zexu Pan (National University of Singapore)*; Gordon Wichern (Mitsubishi Electric Research Laboratories (MERL)); Yoshiki Masuyama (Tokyo Metropolitan University​); François G Germain (Mitsubishi Electric Research Laboratories (MERL)); Sameer Khurana (Mitsubishi Electric Research Lab); Chiori Hori (Mitsubishi Electric Research Laboratories (MERL)); Jonathan LeRoux (Mitsubishi Electric Research Laboratories (MERL))
171 Generative Asr Error Correction With Large Language Models Chao-Han Huck Yang (Amazon)*; Yile Gu (Amazon.com, USA); Yi-Chieh Liu (Georgia Institute of Technology ); Shalini Ghosh (Amazon Alexa AI); Ivan Bulyko (Amazon); Andreas Stolcke (Amazon)
172 Enhancing Task-Oriented Dialogues With Chitchat: A Comparative Study Based On Lexical Diversity And Divergence Armand Stricker (LISN, CNRS)*; Patrick Paroubek (LISN)
174 Token-Level Serialized Output Training For Joint Streaming Asr And St Leveraging Textual Alignments Sara Papi (FBK)*; Peidong Wang (Microsoft); Junkun Chen (Microsoft); JIAN XUE (Microsoft Corporation); Jinyu Li (Microsoft); Yashesh Gaur (Microsoft)
175 Lae-St-Moe: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task For E2E Code-Switching Asr Guodong Ma (NetEase Yidun AI Lab)*; Wenxuan Wang (NetEase Yidun AI Lab); Yuke Li (NetEase Yidun AI Lab); Yuting Yang (NetEase Yidun AI Lab); Binbin Du (NetEase Yidun AI Lab); Haoran Fu (Department of Civil Engineering, Zhejiang University)
177 A Token-Wise Beam Search Algorithm For Rnn-T Gil Keren (Facebook)*
181 Joint Federated Learning And Personalization For On-Device Asr Junteng Jia (Meta AI)*; Ke Li (Johns Hopkins University); Mani Malek (Meta); Kshitiz Malik (Meta); Jay Mahadeokar (Meta AI); Ozlem Kalinli (Meta); Frank Seide (Meta AI)
183 Melhubert: A Simplified Hubert On Mel Spectrograms Tzu-Quan Lin (National Taiwan University)*; Hung-yi Lee (National Taiwan University); Hao Tang (The University of Edinburgh)
185 Exploring Data Augmentation In Bias Mitigation Against Non-Native-Accented Speech Yuanyuan Zhang (Technische Universiteit Delft)*; Aaricia Herygers (-); Tanvina Patel (Multimedia computing, Delft University of Technology ); Zhengjun Yue (Technische Universiteit Delft); Odette Scharenborg (Multimedia Computing Group, Delft University of Technology)
190 Awmc: Online Test-Time Adaptation Without Mode Collapse For Continual Adaptation Jae-Hong Lee (Hanyang University)*; Dohee Kim (Hanyang University); Joon-Hyuk Chang (Hanyang University)
192 Le-Ssl-Mos: Self-Supervised Learning Mos Prediction With Listener Enhancement Zili Qi (Hithink RoyalFlush AI Research Institute)*; Xinhui Hu (Hithink RoyalFlush AI Research Institute); Wangjin Zhou (Kyoto University); Sheng Li (National Institute of Information & Communications Technology (NICT)); Hao Wu (Hithink RoyalFlush AI Research Institute); Jian Lu (Hithink RoyalFlush AI Research Institute); Xinkang Xu (Hithink RoyalFlush AI Research Institute)
198 Transcribing And Aligning Conversational Speech: A Hybrid Pipeline Applied To French Conversations Hiroyoshi Yamasaki (Aix-Marseille University); Jérôme Louradour (Linagora); Julie Hunter (LINAGORA); Laurent Prevot (Aix Marseille Université & CNRS)*
201 Fedcpc: An Effective Federated Contrastive Learning Method For Privacy Preserving Early-Stage Alzheimer’S Speech Detection wenqing wei (Japan Advanced Institute of Science and Technology); Zhengdong Yang (Kyoto University); Gao Yuan (Japan Advanced Institute of Science and Technology); Jiyi Li (University of Yamanashi); Chenhui Chu (Kyoto University); Shogo Okada (Japan Advanced Institute of Science and Technology); Sheng Li (National Institute of Information & Communications Technology (NICT))*
204 Toward General-Purpose Text-Instruction-Guided Voice Conversion Chun-Yi Kuan (National Taiwan University)*; Chen An Li (National Taiwan University); Tsu-Yuan Hsu (National Taiwan University); Tse-Yang  Lin (National Taiwan University); ho lam Chung (National Taiwan University); Kai-Wei Chang (National Taiwan University); Shuo-yiin Chang (Google); Hung-yi Lee (National Taiwan University)
206 Av-Data2Vec: Self-Supervised Learning Of Audio-Visual Speech Representations With Contextualized Target Representations Jiachen Lian (University of California Berkeley)*; Alexei Baevski (Facebook AI Research); Wei-Ning Hsu (Meta); Michael  Auli (Meta)
207 Improving Stability In Simultaneous Speech Translation: A Revision-Controllable Decoding Approach Junkun Chen (Microsoft)*; JIAN XUE (Microsoft Corporation); Peidong Wang (Microsoft); Jing Pan (Microsoft); Jinyu Li (Microsoft)
211 Haha-Pod: An Attempt For Laughter-Based Non-Verbal Speaker Verification Yuke Lin (Wuhan University); Xiaoyi Qin (Dukekunshan University); Ning Jiang (Mashang Consumer Finance Co., Ltd.); Guoqing Zhao (Mashang Consumer Finance Co., Ltd); Ming Li (Duke Kunshan University)*
220 Pp-Met: A Real-World Personalized Prompt Based Meeting Transcription System xiang lyu (ximalaya)*; Yuhang Cao (ximalaya); qing wang (ximalaya); Jingjing Yin (Ximalaya); Yuguang Yang (Ximalaya Inc., ShangHai, China); pengpeng zou (ximalaya); xuecheng hu (ximalaya); yanni hu (ximalaya); heng lu (ximalaya)
223 Brouhaha: Multi-Task Training For Voice Activity Detection, Speech-To-Noise Ratio, And C50 Room Acoustics Estimation Marvin Lavechin (ENS, Meta AI)*; Marianne Metais (ENS); Hadrien Titeux (ENS); Alodie Boissonnet (Meta AI); Jade Copet (Meta AI); Morgane Riviere (Meta AI); Elika Bergelson (Duke University); Alejandrina Cristia (Exelang, CNRS, LSCP); Emmanuel Dupoux (EHESS, ENS, PSL University, CNRS, INRIA, META); Hervé Bredin (CNRS)
224 Magnitude-And-Phase-Aware Speech Enhancement With Parallel Sequence Modeling Yuewei Zhang (Shanghai Jiao Tong University)*; Huanbin Zou (Tencent); jie zhu (Shanghai Jiao Tong University)
228 Speaker Adaptation For End-To-End Speech Recognition Systems In Noisy Environments Dominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm)*; Ilja Baumann (Technische Hochschule Nürnberg Georg Simon Ohm); Sebastian P Bayerl (Technische Hochschule Nürnberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm); Tobias Bocklet (TH Nürnberg )
233 Improving Severity Preservation of Healthy-To-Pathological Voice Conversion With Global Style Tokens Bence Halpern (Netherlands Cancer Institute)*; Wen-Chin Huang (Nagoya University); Lester Phillip G Violeta (Nagoya University); Rob van Son (Netherlands Cancer Institute); Tomoki Toda (Nagoya University)
235 End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis Can Cui (Inria)*; Imran Sheikh (Vivoka); Mostafa Sadeghi (INRIA); Emmanuel Vincent (Inria)
238 GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition Daniel Galvez (NVIDIA)*; Tim Kaldewey (NVIDIA)
239 Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition Hillary Ngai (Google)*; Rohan Agrawal (Google); Parisa Haghani (Google); Pedro J Moreno (Google); W. Ronny Huang (Google); Neeraj Gaur (Google)
243 CAMSAT: Augmentation Mix and Self-Augmented Training Clustering for Self-Supervised Speaker Recognition Abderrahim Fathan (Computer Research Institute of Montreal (CRIM), Montreal, Quebec, Canada)*; Jahangir  Alam (Computer Research Institute of Montreal (CRIM), Montreal (Quebec) Canada)
244 Toward Universal Speech Enhancement For Diverse Input Conditions Wangyou Zhang (Shanghai Jiao Tong University)*; Kohei Saijo (Waseda University); Zhong-Qiu Wang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Yanmin Qian (Shanghai Jiao Tong University)
245 Adversarial Augmentation for Adapter Learning Jen-Tzung Chien (National Yang Ming Chiao Tung University)*; Wei-Yu Sun (National Yang Ming Chiao Tung University)
246 Optimizing Two-Pass Cross-Lingual Transfer Learning:  Phoneme Recognition And Phoneme To Grapheme Translation Wonjun Lee (POSTECH)*; Yunsu Kim (POSTECH); Gary Geunbae  Lee  (Postech)
248 Ctc Blank Triggered Dynamic Layer-Skipping For Efficient Ctc-Based Speech Recognition Junfeng Hou (Netease)*; Peiyao Wang (Netease); Jincheng Zhang (Netease); Meng Yang (Netease); Minwei Feng (Netease); Jingcheng Yin (Netease)
252 Prompt Pool Based Class-Incremental Continual Learning for Dialog State Tracking Hong Liu (Tsinghua University)*; Yucheng Cai (tsinghua university); Yuan Zhou (None); Zhijian Ou (Tsinghua University); Yi Huang (China Mobile Research); Junlan Feng (China Mobile Research)
253 Model-based Fairness Metric for Speaker Verification Maliha Jahan (Johns Hopkins University)*; Laureano Moro-Velazquez (Johns Hopkins University); Thomas Thebaud (Johns Hopkins University); Najim Dehak (Johns Hopkins University); Jesus Antonio Villalba (Johns Hopkins University)
258 The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains Erica Cooper (National Institute of Informatics)*; Wen-Chin Huang (Nagoya University); Yu Tsao (Academia Sinica); Hsin-Min Wang (Academia Sinica); Tomoki Toda (Nagoya University); Junichi Yamagishi (National Institute of Informatics)
263 Cross-Modal Alignment with Optimal Transport for Ctc-Based Asr Xugang Lu (NICT)*; Peng Shen (NICT); Yu Tsao (Academia Sinica); Hisashi Kawai (NICT)
264 Study on the Correlation between Objective Evaluations and Subjective Speech Quality and Intelligibility Hsin-Tien Chiang (Academia Sinica); Kuo-Hsuan Hung (Academia Sinica); Szu-Wei Fu (NVIDIA); Heng-Cheng Kuo (Academia Sinica); Ming-Hsueh Tsai (National Academy for Educational Research ); Yu Tsao (Academia Sinica)*
265 Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model Kai-Wei Chang (National Taiwan University)*; Ming-Hsin Chen (National Taiwan University); Yun-Ping Lin (National Taiwan University); Jing Neng Hsu (National Taiwan University); Paul KM Huang (NTU); Chien-yu Huang (National Taiwan University); Shang-Wen Li (FAIR); Hung-yi Lee (National Taiwan University)
266 VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model Yayun He (Ping An Technology (Shenzhen) Co., Ltd)*; Zuheng Kang (Ping An Technology (Shenzhen) Co., Ltd); Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd); Junqing Peng (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China)
268 Zero-Shot Singing Voice Synthesis From Musical Score Jun-You Wang (National Taiwan University)*; Hung-yi Lee (National Taiwan University); Roger Jang (); Li Su (Academia Sinica)
271 PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models Robin Netzorg (UC Berkeley)*; Ajil Jalal (UC Berkeley ); Luna McNulty (Brown University); Gopala Krishna Anumanchipalli (UC Berkeley)
272 Boosting Modality Representation with Pre-trained Models and Multi-task Training for Multimodal Sentiment Analysis Jiarui Hai (Johns Hopkins University)*; Yu-Jeh Liu (Johns Hopkins University); Mounya Elhilali (Johns Hopkins University)
274 Efficient Text-Only Domain Adaptation for CTC-based ASR Chang Chen (Shanghai Jiao Tong University); Xun Gong (Shanghai Jiaotong University)*; Yanmin Qian (Shanghai Jiao Tong University)
275 Adapting Pretrained Speech Model For Mandarin Lyrics Transcription And Alignment Jun-You Wang (National Taiwan University)*; Chon In Leong (National Taiwan University); Yu-Chen Lin (National Taiwan University); Li Su (Academia Sinica); Roger Jang ()
276 Partial Rank Similarity Minimization Method for Quality Mos Prediction Oo Unseen Speech Synthesis Systems in Zero-Shot and Semi-Supervised Setting Hemant Yadav (IIIT Delhi)*; Erica Cooper (National Institute of Informatics); Junichi Yamagishi (National Institute of Informatics); Sunayana Sitaram (Microsoft Research); Rajiv Ratn Shah (IIIT Delhi)
282 Coco-Nut: Corpus Of Japanese Utterance And Voice Characteristics Description For Prompt-Based Control Aya Watanabe (The University of Tokyo)*; Shinnosuke Takamichi (The University of Tokyo); Yuki Saito ("The University of Tokyo, Japan"); Wataru Nakata (The University of Tokyo); Detai Xin (The University of Tokyo); Hiroshi Saruwatari (The University of Tokyo)
284 Generative Linguistic Representation For Spoken Language Identification Peng Shen (NICT)*; Xugang Lu (NICT); Hisashi Kawai (NICT)
287 Spike-Triggered Contextual Biasing For End-To-End Mandarin Speech Recognition Kaixun Huang (NWPU)*; Ao Zhang (Northwestern Polytechnical University); Binbin Zhang (Horizon Robotics); Tianyi Xu (NWPU); Xingchen Song (Tsinghua University); Lei Xie (NWPU)
290 Towards Robust Packet Loss Concealment System With Asr-Guided Representations Da-Hee Yang (Hanyang University); Joon-Hyuk Chang (Hanyang University)*
294 U2-Kws: Unified Two-Pass Open-Vocabulary Keyword Spotting With Keyword Bias Ao Zhang (Northwestern Polytechnical University)*; Pan Zhou (Li Auto Inc.); Kaixun Huang (NWPU); Yong Zou (Li Auto Inc. ); Ming Liu (Li Auto Inc.); Lei Xie (NWPU)
295 Consistency Based Unsupervised Self-Training For Asr Personalisation Jisi Zhang (Samsung Research UK)*; Vandana Rajan (Samsung Research UK); Haaris Mehmood (Samsung Research UK); David Tuckey (Samsung Research UK); Pablo Peso Parada (Samsung Research UK); Md Asif Jalal (Samsung Research UK); KARTHIKEYAN SARAVANAN (Samsung Research, UK); Gil Ho Lee (Samsung Electronics); Jung In Lee (Samsung Electronics); Seokyeong Jung (Samsung Electronics)
299 Towards A Unified End-To-End Language Understanding System For Speech And Text Inputs Mohan LI (Toshiba Europe Ltd)*; Catalin Zorila (Toshiba Cambridge Research Lab); Cong-Thanh Do (Toshiba Research Europe Ltd.); Rama S Doddipatla (Toshiba Europe LTD)
301 On Decoder-only Architecture for Speech-to-text and Large Language Model Integration Jian Wu (Microsoft)*; Yashesh Gaur (Microsoft); Zhuo Chen (Microsoft); Long Zhou (Microsoft Research Asia); Yimeng Zhu (Microsoft China); Tianrui Wang (Microsoft Research Asia ); Jinyu Li (Microsoft); Shujie Liu (Microsoft Research Asia); Bo Ren (Microsoft); Linquan Liu (Microsoft China); Yu Wu (Microsoft Research Asia)
302 Paraconsistent Feature  Analysis For the Competency  Evaluation  of  Voice Impersonation Rajeev Rajan (Government Engineering College, Barton Hill, Trivandrum)*; Noumida  A (College Of Engineering Trivandrum); Sreelakshmi  S (GOVERNMENT ENGINEERING COLLEGE, BARTON HILL)
303 Knowledge Distillation from Offline to Streaming Transducer: Toward Accurate and Fast Streaming Model by Matching Alignments Ji-Hwan Mo (Hanyang University); Jae-Jin Jeon (Kakao Enterprise Corporation); MUNHAK LEE (Hanyang University); Joon-Hyuk Chang (Hanyang University)*
304 Transformer Attractors for Robust and Efficient End-to-end Neural Diarization Lahiru T Samarakoon (Fano Labs, Hong Kong)*; Samuel J Broughton (Fano Labs); Marc Härkönen (Fano Labs); Ivan Fung (Fano Labs)
308 Detection of Vowel Errors in Children's Speech Using Synthetic Phonetic Transcripts Ilja Baumann (Technische Hochschule Nürnberg Georg Simon Ohm)*; Dominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm); Elmar  Noeth (friedrich Alexander Universitat, Erlangen-Nuremberg); Tobias Bocklet (TH Nürnberg )
312 Invert-Classify: Recovering Discrete Prosody Inputs for Text-to-Speech Nicholas J Sanders (University of Edinburgh)*; Korin Richmond (University of Edinburgh)
313 Kaq: A Non-Intrusive Stacking Framework for Mean Opinion Score Prediction with Multi-Task Learning Chenglin Xu (Kuaishou Technology)*; Xiguang Zheng (北京达佳互联信息技术有限公司); Chen Zhang (北京达佳互联信息技术有限公司); Chao Zhou (Kuaishou Technology); Qi Huang (Kuaishou Technology); Bing Yu (Kuaishou Technology)
316 SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR Yangze Li (Northwestern Polytechnical University)*; Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yuhao Liang (Northwestern Polytechnical University); Pengcheng Guo (Northwestern Polytechnical University); Mohan Shi (University of Science and Technology of China); Zhihao Du (Speech Lab of DAMO Academy, Alibaba Group); Shiliang Zhang (Alibaba Group); Lei Xie (Northwestern Polytechnical University)
318 Simulation of Teacher-Learner Interaction in English Language Pronunciation Learning Elaf Islam (The University of Sheffield)*; Thomas Hain (University of Sheffield); Protima Nomo Sudro (University of Sheffield)
320 Ending The Blind Flight: Analyzing The Impact of Acoustic And Lexical Factors on Wav2Vec 2.0 in Air-Traffic Control Alexander Blatt (Saarland University)*; Badr Abdullah (Saarland University); Dietrich  Klakow (Saarland University)
323 Cross-modal learning for CTC-based ASR: Leveraging CTC-BERTScore and sequence-level training MUNHAK LEE (Hanyang University); Sang-Eon Lee (Hanyang University); Jieun Choi (Hanyang University); Joon-Hyuk Chang (Hanyang University)*
324 Clustering Unsupervised Representations As Defense Against Poisoning Attacks on Speech Commands Classification System Thomas Thebaud (Johns Hopkins University)*; Sonal Joshi (Johns Hopkins University); Henry Li Xinyuan (Johns Hopkins University); Martin Sustek (Johns Hopkins University); Jesús Antonio Villalba  López (Johns Hopkins University (JHU)); Sanjeev Khudanpur (Johns Hopkins University); Najim Dehak (Johns Hopkins University)
326 ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings Jenthe Thienpondt (IDLab, Ghent University)*; Kris Demuynck (Ghent Universitty)
327 Multitask Learning Model with Text And Speech Representation for Fine-Grained Speech Scoring Seongjin Park (Educational Testing Service)*; Rutuja Ubale (Educational Testing Service Research)
330 Librispeech-Pc: Benchmark For Evaluation Of Punctuation And Capitalization Capabilities Of End-To-End Asr Models Aleksandr Meister (NVIDIA)*; Matvei Novikov (NVIDIA); Nikolay Karpov (NVIDIA); Evelina Bakhturina (Nvidia); Vitaly Lavrukhin (NVIDIA); Boris Ginsburg (NVIDIA)
331 Evaluating Self-Supervised Speech Models on A Taiwanese Hokkien Corpus Yi-Hui Chou (Carnegie Mellon University)*; Kalvin Chang (Carnegie Mellon University); Meng-Ju Wu (N/A); Winston Ou (Scripps College); Alice Wen-Hsin Bi (University of Maryland); Carol Yang (N/A); Bryan Y. Chen (Swarthmore College); Rong-Wei Pai (National Taiwan Normal University); Po-Yen Yeh (China Medical University, Taiwan); Jo-Peng Chiang (National Taiwan University); Iu-Tshiann Phoann (N/A); Winnie Chang (Carnegie Mellon University); Chenxuan Cui (Carnegie Mellon University); Noel Chen (Carnegie Mellon University); Jiatong Shi (Carnegie Mellon University)
332 Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition Dima Rekesh (Nvidia)*; Nithin Rao Koluguri (NVIDIA); Samuel Kriman (NVIDIA); Somshubra Majumdar (NVIDIA); Vahid Noroozi (NVIDIA); He Huang (NVIDIA); Oleksii Hrinchuk (NVIDIA); Krishna C Puvvada (NVIDIA); Ankur Kumar (UCLA); Jagadeesh Balam (NVIDIA); Boris Ginsburg (NVIDIA)
333 Parameter-Efficient Cross-Language Transfer Learning For A Language-Modular Audiovisual Speech Recognition Zhengyang Li (Technische Universität Carolo-Wilhelmina Braunschweig)*; Thomas Graave (Technische Universität Carolo-Wilhelmina Braunschweig); Jing Liu (Amazon.com); Timo Lohrenz (Technische Universität Carolo-Wilhelmina Braunschweig); Siegfried Kunzmann (Amazon.com); Tim Fingscheidt ( Technische Universität Braunschweig)
336 Generalized Zero-Shot Audio-to-Intent Classification Veera Raghavendra Elluru (AWS AI Labs)*; Devang Kulshreshtha (Amazon); Rohit Paturi (AWS AI Labs); Sravan Babu Bodapati (Amazon); Srikanth Ronanki (Amazon)
337 Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers Zijian Yang (Lehrstuhl fuer Informatik 6, RWTH Aachen)*; Wei Zhou (Chair of Computer Science 6, RWTH Aachen University); Ralf Schlüter (RWTH Aachen University); Hermann Ney ( RWTH Aachen University)
339 Torchaudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch Jeff Hwang (Meta)*; Moto Hira (Meta); Caroline Chen (Meta); Xiaohui Zhang (Meta); Zhaoheng Ni (Meta AI); Guangzhi Sun (University of Cambridge Department of Engineering); Pingchuan Ma (Meta); Ruizhe Huang (Johns Hopkins University); Vineel Pratap (Facebook); Yuekai Zhang (NVIDIA); Anurag Kumar (Facebook Reality Labs); Chin-Yun Yu (Queen Mary University of London); Chuang Zhu (NVIDIA); Chunxi Liu (Two Sigma); Jacob D Kahn (Facebook AI Research); Mirco Ravanelli (Université de Montréal); Peng Sun (NVIDIA); Shinji Watanabe (Carnegie Mellon University); Yangyang Shi (Facebook); Yumeng Tao (Meta)
340 Deriving Translational Acoustic Sub-Word Embeddings Amit Meghanani (University of Sheffield)*; Thomas Hain (University of Sheffield)
342 A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability JIAN XUE (Microsoft Corporation)*; Peidong Wang (Microsoft); Jinyu Li (Microsoft); eric sun (Microsoft)
343 Transferring Speech-Generic and Depression-Specific Knowledge for Alzheimer'S Disease Detection Ziyun Cui (Tsinghua University)*; Wen Wu (University of Cambridge); Chao Zhang (Tsinghua University); Wei-Qiang Zhang (Tsinghua University); Ji Wu (Tsinghua University)
344 Robust Logarithmic Champernowne Algorithm for Feedback Cancellation in Hearing Aids Vanitha Devi R (National Institute of Technology Warangal)*; Vasundhara . (NIT Warangal)
347 Hierarchical Attention-Based Contextual Biasing for Personalized Speech Recognition Using Neural Transducers Sibo Tong (Amazon)*; Philip Harding (Amazon Alexa); Simon Wiesler (Amazon)
352 E3 Tts: Easy End-To-End Diffusion-Based Text To Speech Yuan Gao (Google)*; Nobuyuki Morioka (Google); Yu Zhang (Google); Nanxin Chen (Google)
354 Building High-Accuracy Multilingual ASR with Gated Language Experts And Curriculum Training eric sun (Microsoft)*; Jinyu Li (Microsoft); Yuxuan Hu (Microsoft); Yimeng Zhu (Microsoft); Long Zhou (Microsoft Research Asia); JIAN XUE (Microsoft Corporation); Peidong Wang (Microsoft); Linquan Liu (Microsoft); Shujie Liu (Microsoft Research Asia); Ed C Lin (Microsoft); Yifan Gong (Microsoft)
360 Flap: Fast Language-Audio Pre-Training Ching-Feng Yeh (Facebook AI Research)*; Po-Yao Huang (Facebook AI Research); Vasu Sharma (Facebook AI Research); Shang-Wen Li (FAIR); Gargi Ghosh (Facebook AI Research)
361 On The Relevance Of Phoneme Duration Variability Of Synthesized Training Data For Automatic Speech Recognition Nick Rossenbach (RWTH Aachen University / AppTek GmbH)*; Benedikt Hilmes (HLT); Ralf Schlüter (RWTH Aachen University)
362 Enabling Noisy Label Usage for Out-Of-Airspace Data in Read-Back Error Detection Lakshmi Rajendram Bashyam (ZBW - Leibniz-Informationszentrum Wirtschaft); Alexander Blatt (Saarland University)*; Dietrich  Klakow (Saarland University)
367 Enhancing Expressivity Transfer in Textless Speech-to-Speech Translation jarod duret (LIA)*; Benjamin O'Brien (LIA - Avignon University); Yannick Estève (LIA - Avignon University); Titouan Parcollet (Samsung AI Research)
368 Dialect Adaptation and Data Augmentation for Low-Resource ASR: Team XYZ Systems for the MADASR 2023 Challenge Tanel Alumae (Tallinn University of Technology)*; Jiaming Kong (Tallinn University of Technology ); Daniil Robnikov (Tallinn University of Technology)
370 Reducing the Cost of Spoof Detection Labeling Using Mixed-Strategy Active Learning and Pretrained Models Mark R Lindsey (Carnegie Mellon University)*; Nathaniel R Robinson (Carnegie Mellon University); Francis Kubala (Probity, Inc.); Richard M Stern (Carnegie Mellon University)
373 A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction Kohei Saijo (Waseda University)*; Wangyou Zhang (Shanghai Jiao Tong University); Zhong-Qiu Wang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Tetsunori Kobayashi (Waseda University); Tetsuji Ogawa (Waseda University)
375 Two-Pass Endpoint Detection for Speech Recognition Anirudh Raju (Amazon Alexa); Di He (Amazon); Aparna Khare (Amazon)*; Ilya Sklyar (Amazon); Long Chen (Amazon); Viet Anh Tranh (Amazon); Zhe Zhang (Amazon); Colin Vaz (Amazon); Sam Alptekin (Amazon); Venkatesh Ravichandran (Amazon); Roland Maas (Amazon Inc.); Ariya Rastrow (Amazon Alexa)
379 Improved Long-Form Speech Recognition by Jointly Modeling The Primary and Non-Primary Speakers Guru Prakash Arumugam (Google LLC)*; Shuo-yiin Chang (Google); Tara Sainath (Google); Rohit Prabhavalkar (Google); Quan Wang (Google); Shaan Bijwadia (Google)
380 Exploring Time-Frequency Domain Target Speaker Extraction For Causal and Non-Causal Processing Wangyou Zhang (Shanghai Jiao Tong University)*; Lei Yang (Samsung Research China – Beijing); Yanmin Qian (Shanghai Jiao Tong University)
381 Joint Energy-Based Model for Robust Speech Classification System against Dirty-Label Backdoor Poisoning Attacks Martin Sustek (Brno University of Technology)*; Sonal Joshi (Johns Hopkins University); Henry Li Xinyuan (Johns Hopkins University); Thomas Thebaud (Johns Hopkins University); Jesus Antonio Villalba (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University); Najim Dehak (Johns Hopkins University)
382 Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR Sheikh Shams Azam (Apple)*; Tatiana Likhomanenko (Apple); Martin Pelikan (Apple); Jan Silovsky (Apple)
383 Joint Audio and Speech Understanding Yuan Gong (Massachusetts Institute of Technology)*; Alexander H Liu (MIT); Hongyin Luo (MIT); Leonid Karlinsky (IBM-Research); James Glass (Massachusetts Institute of Technology)
385 No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation Dennis Fucci (Fondazione Bruno Kessler)*; Marco Gaido (Fondazione Bruno Kessler); Matteo Negri (Fondazione Bruno Kessler); Mauro Cettolo (Fondazione Bruno Kessler); Luisa Bentivogli  (Fondazione Bruno Kessler )
387 Improving Audiovisual Active Speaker Detection in Egocentric Recordings with the Data-efficient Image Transformer Jason Clarke (University of Sheffield)*; Yoshihiko Gotoh (University of Sheffield); Stefan Goetze (University of Sheffield)
391 YODAS: Youtube-Oriented Dataset for Audio and Speech Xinjian Li (Carnegie Mellon University)*; Shinnosuke Takamichi (The University of Tokyo); Takaaki Saeki (The University of Tokyo); William Chen (Carnegie Mellon University); Sayaka Shiota (Tokyo Metropolitan University); Shinji Watanabe (Carnegie Mellon University)
392 Discriminative Speech Recognition Rescoring With Pre-trained Language Models Prashanth Gurunath Shivakumar (Amazon)*; Jari T Kolehmainen (Amazon); Yile Gu (Amazon.com, USA); Ankur Gandhe (Amazon Alexa); Ariya Rastrow (Amazon Alexa); Ivan Bulyko (Amazon)
394 Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection Jiachen Lian (University of California Berkeley)*; Carly Z Feng (University of California, Berkeley); Naasir S Farooqi (UC Berkeley); Steve Li (Berkeley Speech Group); Anshul P Kashyap (UC Berkeley); Cheol Jun Cho (UC Berkeley); Peter Wu (UC Berkeley); Robin Netzorg (UC Berkeley); Tingle Li (UC Berkeley); Gopala Krishna Anumanchipalli (UC Berkeley)
395 Minisuperb: Lightweight Benchmark for Self-Supervised Speech Models Yu-Hsiang Wang (National Taiwan University)*; Huang-Yu Chen (National Taiwan University); Kai-Wei Chang (National Taiwan University); Winston H. Hsu (National Taiwan University); Hung-yi Lee (National Taiwan University)
399 MASR: Multi-Label Aware Speech Representation Learning ANJALI RAJ (Google); Shikhar Bharadwaj (Google); Sriram Ganapathy (Google); Min Ma (Google Research); Shikhar Vashishth (Google)*
403 A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023 Ryuichi Yamamoto (LINE Corp.)*; Reo Yoneyama (Nagoya University); Lester Phillip G Violeta (Nagoya University); Wen-Chin Huang (Nagoya University); Tomoki Toda (Nagoya University)
404 Extending Self-distilled Self-supervised Learning for Semi-supervised Speaker Verification Jeong-Hwan Choi (Hanyang University); Jehyun Kyung (Hanyang University); Ju-seok Seong (Hanyang University); Ye-Rin Jeoung (Hanyang University); Joon-Hyuk Chang (Hanyang University)*
406 Pseudo-label based Supervised Contrastive Loss for Robust Speech Representations Varun Krishna PS Krishna (Indian Institute of Science)*; Sriram Ganapathy (Indian Institute of Science, Bangalore, India, 560012)
409 Audio-Visual Neural Syntax Acquisition Cheng-I Lai (MIT)*; Haoyue Shi (Toyota Technological Institute at Chicago); Puyuan Peng (The University of Texas at Austin); Yoon Kim (MIT); Kevin Gimpel (Toyota Technological Institute at Chicago); Shiyu Chang (UCSB); Yung-Sung Chuang (MIT); Saurabhchand Bhati (Johns Hopkins University ); David Cox (MIT-IBM Watson AI Lab); David Harwath (The University of Texas at Austin); Yang Zhang (IBM T. J. Watson Research); Karen Livescu (TTI-Chicago); James Glass (Massachusetts Institute of Technology)
411 Improving Speech Enhancement Using Audio Tagging Knowledge From Pre-Trained Representations And Multi-Task Learning Shaoxiong Lin (ShanghaiJiaoTongUniversity); Chao Zhang (Tsinghua University); Yanmin Qian (Shanghai Jiao Tong University)*
412 Ba-Moe: Boundary-Aware Mixture-Of-Experts Adapter For Code-Switching Speech Recognition Peikun Chen (Northwestern Polytechnical University)*; Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yuhao Liang (Northwestern Polytechnical University); Hongfei Xue (NWPU); Xuchen Wan (Huawei Technologies Co., Ltd.); Naijun Zheng (Huawei Technologies Co., Ltd.); zhou huan (AARC, Huawei Technologies Co., Ltd.); Lei Xie (Northwestern Polytechnical University)
413 Zero-shot Domain-sensitive Speech Recognition with Prompt-conditioning Fine-tuning Yung-Chieh Chan (MediaTek Research); Feng-Ting Liao (MediaTek Research)*; Chan-Jan Hsu (mediatek research); Yi-Chang Chen (Mediatek Research); Da-shan Shiu (MediaTek Research)
414 Few-Shot Spoken Language Understanding via Joint Speech-Text Models Chung-Ming Chien (Toyota Technological Institute at Chicago)*; Mingjiamei Zhang (University of Chicago); Ju-Chieh Chou (TTIC); Karen Livescu (TTI-Chicago)
415 Summarize while Translating: Universal Model with Parallel Decoding for Summarization and Translation Takatomo Kano (NTT Corporation)*; Atsunori Ogawa (NTT Corporation); Marc Delcroix (NTT); Kohei Matsuura (NTT); Takanori Ashihara (NTT Corp.); William Chen (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)
417 Acoustic Model Fusion for End-To-End Speech Recognition Zhihong Lei (Apple); Mingbin Xu (Apple Inc.)*; Shiyi Han (Apple); Leo Liu (Apple); Zhen Huang (Apple); Tim Ng (Apple); Yuanyuan Zhang (Apple); Ernest Pusateri (Apple Inc.); Mirko Hannemann (Apple); Yaqiao Deng (Apple); Man-Hung Siu (Apple)
420 Domain Adaptation by Data Distribution Matching via Submodularity for Speech Recognition Yusuke Shinohara (Yahoo Japan Corporation)*; Shinji Watanabe (Carnegie Mellon University)
422 The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2Met 2.0): A Benchmark for Speaker-Attributed ASR Yuhao Liang (Northwestern Polytechnical University)*; Mohan Shi (University of Science and Technology of China); Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yangze Li (Northwestern Polytechnical University); Shiliang Zhang (Alibaba Group); Zhihao Du (Speech Lab of DAMO Academy, Alibaba Group); Lei Xie (Northwestern Polytechnical University); Yanmin Qian (Shanghai Jiao Tong University); Jian Wu (Microsoft); Zhuo Chen (Microsoft); Kong Aik Lee (ICT Cluster, Singapore Institute of Technology); Zhijie Yan (Alibaba Inc.); Hui Bu (AISHELL)
423 Slm: Bridging The Thin Gap Between Speech and Text Foundational Models Mingqiu Wang (Google Inc)*; Wei Han (Google); Izhak Shafran (Google AI); Zelin Wu (Google LLC); Chung-Cheng Chiu (Google); Yuan Cao (Google Brain); Nanxin Chen (Google); Yu Zhang (Google); Hagen Soltau (Google); Paul Rubenstein (Google); Lucas Zilka (Google); Dian Yu (Google); Golan Pundak (Google); Nikhil Siddhartha (Google.com); Johan Schalkwyk (Google); Yonghui Wu (Google)
424 An Exploration of Task-decoupling on Two-stage Neural Post Filter for Real-time Personalized Acoustic Echo Cancellation Zihan Zhang (Northwestern Polytechnical University)*; Jiayao Sun (Northwestern Polytechnical University); Xianjun Xia (RTC Lab, ByteDance); Ziqian Wang (Northwestern Polytechnical University); Xiaopeng Yan (Northwestern Polytechnical University); Yijian Xiao (ByteDabce); Lei Xie (Northwestern Polytechnical University)
425 Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation Zhaofeng Lin (Multimedia Computing Group, Delft University of Technology); Tanvina Patel (Multimedia computing, Delft University of Technology ); Odette Scharenborg (Multimedia Computing Group, Delft University of Technology)*
426 Leveraging The Multilingual Indonesian Ethnic Languages Dataset in Self-Supervised Model for Low-Resource ASR Task Sakriani Sakti (Japan Advanced Institute of Science and Technology)*; Benita Angela Titalim (JAIST)
429 PromptSpeaker: Speaker Generation Based on Text Descriptions yongmao zhang (Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China)*; Guanghou Liu (Northwestern Polytechnical University); Yi Lei (Northwestern Polytechnical University); Yunlin Chen (mobvoi); Hao Yin (mobvoi); Lei Xie (NWPU); Zhifei Li (Mobvoi)
430 HiGNN-TTS : Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS Dake Guo (Northwestern Polytechnical University)*; Xinfa Zhu (Northwestern Polytechnical University); Liumeng Xue (The Chinese University of Hong Kong, Shenzhen); Tao Li (School of Computer Science, Northwestern Polytechnical University, Xi’an); Yuanjun Lv (Northwestern Polytechnical University); Yuepeng Jiang (Northwestern Polytechnical University); Lei Xie (NWPU)
433 Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis Yuke Li (Audio, Speech and Language Processing Group  (ASLP@NPU))*; Xinfa Zhu (Northwestern Polytechnical University); Yi Lei (Northwestern Polytechnical University); Hai Li (iQIYI Inc); Junhui Liu (iQIYI Inc); Danming Xie (iQIYI); Lei Xie (NWPU)
435 VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling Ziqian Ning (Northwestern Polytechnical University)*; Yuepeng Jiang (Northwestern Polytechnical University); Bin Zhang (Tencent Music Entertainment Group(TME)); Lei Xie (NWPU); Zhichao Wang (Northwestern Polytechnical University)
436 SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation Yuanjun Lv (Northwestern Polytechnical University)*; Jixun Yao (Northwestern Polytechnical University); Peikun Chen (Northwestern Polytechnical University); Hongbin Zhou (Ximalaya Inc.); Heng Lu (Ximalaya Inc.); Lei Xie (Northwestern Polytechnical University)
440 MUST: A Multilingual Student-Teacher Learning Approach for Low-Resource Speech Recognition Muhammad Umar Farooq (University of Sheffield)*; Rehan Ahmad (University of Sheffield); Thomas Hain (University of Sheffield)
441 WaveNeXt: ConvNeXt-based fast neural vocoder without iSTFT layer Takuma Okamoto (National Institute of Information and Communications Technology)*; Haruki Yamashita (Kobe University); Yamato Ohtani (National Institute of Information and Communications Technology); Tomoki Toda (Nagoya University); Hisashi Kawai (NICT)
445 Spectral Tilt May Have a Smaller Impact on The Intelligibility of Speech in Noise Yoshiki Sato (University of Aizu)*; Julián Villegas (University of Aizu)
447 H_Eval: A New Hybrid Evaluation Metric For Automatic Speech Recognition Tasks Zitha Sasindran (Indian Institute of Science)*; Harsha Yelchuri (Information Science Engineering RV College of Engineering Bengaluru, India); Prabhakar Venkata Tamma (Electronics Systems Engg); Supreeth Rao ( Indian Institute of Science)
453 Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments Anusha Prakash (Indian Institute of Technology Madras)*; S Umesh (IIT Chennai); Hema A Murthy (IIT Madras)
456 Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech Recognition Geoffroy Vanderreydt (IDLab)*; Amrutha Prasad (Idiap Research Institute); Srikanth Madikeri (Idiap); Driss Khalil (Idiap Research Institute); Kris Demuynck (Ghent Universitty); Petr Motlicek (Idiap)
468 Semi-Supervised Multi-Channel Speaker Diarization with Cross-Channel Attention Shilong Wu (University of Science and Technology of China)*; Jun Du (University of Science and Technology of China); Mao-Kui He (University of Science and Technology of China); Shutong  Niu (University of Science and Technology of China ); Hang Chen (USTC); Haitao Tang (iFLYTEK Research); Chin-hui Lee (Georgia Institute of Technology)
473 Gated Multi Encoders and Multitask Objectives For Dialectal Speech Recognition in Indian Languages Sathvik Udupa (Indian Institute of Science)*; Jesuraj Bandekar (IISc); Deekshitha G (IISc); Saurabh Kumar (IISc Bengaluru); Prasanta Ghosh (); Sandhya  Badiger (IISc Bangalore); Abhayjeet Singh (Indian Institute of Sciences, Bangalore, India); Savitha S Murthy (IISc); Priyanka  Pai (Navana Tech, Mumbai); Srinivasa  Raghavan (Navana Tech, Mumbai); Rohan  Saxena (Navana Tech, Mumbai)
478 Vits-Based Singing Voice Conversion System with Dspgan Post-Processing for SVCC2023 yiquan zhou (xjtu)*; Chen Meng (TME); Yi Lei (Northwestern Polytechnical University); Jihua Zhu (Xi'an Jiaotong  University); weifeng zhao (tencent)

 

 

 

 

 

 

 

logo-wwelcome.png

WELCOME TO TAIWAN

logo

Congress Secretariat

Elite Professional Conference Organizer

footer-solgan